METHOD FOR FINDING LOCAL EXTREMA OF A SET 
OF VALUES FOR A PARALLEL PROCESSING ELEMENT 



CROSS-REFERENCE TO RELATED APPLICATIONS 

[0001] The present application is related to U.S. Patent Application Serial No. 

entitled "Method for Finding Global Extrema of a Set of Bytes Distributed Across an Array of 

Parallel Processus Elements" filed (DB00 1077-000, Micron no. 03-0053), and U.S. 

Patent Application Senal No. entitled "Method for Finding Global Extrema of a 

Set of Shorts Distributed Across an Array of Parallel Processing Elements" filed 

(DB00 1078-000, Micron no. 03-0054). 

BACKGROUND OF THE INVENTION 

[0002] The present invention relates generally to parallel processing and more particularly to 
determining an extrema (e.g., maximum or minimum) from a set of values within a smgle 
processing element of a parallel processing system. 

[0003] Conventional central processing units ("CPU's"), such as those found in most 
personal computers, execute a smgle program (or instruction stream) and operate on a smgle 
stream of data. For example, the CPU fetches its program and data from a random access 
memory ("RAM"), manipulates the data in accordance with the program instructions, and 
wntes the results back sequentially. There is a single stream of instructions and a single 
stream of data (note: a single operation may operate on more than one data item, as in X = Y 
+ Z however, only a single stream of results is produced). Although the CPU may determine 
the sequence of instructions executed in the program itself, only one operation can be 
completed at a time. Because conventional CPUs execute a single program (or instruction 
stream) and operate on a single stream of data, conventional CPUs may be referred to as a 
single-instruction, single data CPU or an SISD CPU. 

[0004] The speed of conventional CPUs has dramatically increased in recent years. 
Additionally, the use of cache memories enables conventional CPUs faster access to the 
desired instruction and data streams. However because conventional CPUs can complete only 
one operation at a time, conventional CPUs are not suitable for extremely demanding 
applications having large data sets (such as moving image processing, high quality speech 
recognition, and analytical modeling applications, among others). 
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1 „ )rem a ta sw it h i „re g is tt rM2Wu S e fl a g Cwasse«e q uaUo Z erc(0)wh e nb yt e-2v,as 
subtracted from byte-4. 

,00591 The above process is repeated until the (odd) exfrema of these, of odd numbere 

found. Then, ,n operation 78, «he local extrema for ,bePE is found by comparing the odd 
extrema to the even extrema. 
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Tx mple the value of the even extrema (i.e., .0) is subtracted from the valne of the odd 
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,0062, hi the currentembodiment, the odd extrema ,n Ml ,s loaded into register RO^The 
lexnema in M2 is subtracted fromRO. A flag C , generated, ^e fag C >s use to 
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noted «ha« .he ,oca. externa may be returned from register M2,o the register file 42 or sen. 
the X register, among others. 

,0064] As previously mentioned, me d,scuss,on of operational process 70 was tamed .o 
finding tine .oca. extiema for a PE havtng a se. of only four values (i.e., byte-! through byte- 
4) Ks hou.dbenoted,however,.ha,ope ra «ona 1 process70maybescaledforanynurnberof 
Mastered on the PE. For example, FIG. 5 ,s a graphiealrepresentiationof.be result 
obteined using operational proeess 70 forase. of values havhrg grea.er ma. four(4)bytes 
according to an embodiment of (he present invention. 

(0065, Referrmg to FIG. 5, „ should be no.ed that eachrowof.be firs, severa rows 
represent me resulls obtamed usmg operational process 70 after one cycle, or clockpuls. 
The first row, for example, represents the resuhs obtained after operation 71 ,s completed 
wherein the address for the firs, odd and even numbered by.es (i.e., byte-1 and byte-2, 
respectively) is selected from the reg,ster file 42. Similarly, .he second row represents ; *e 

for.be next odd numbered byte (.,., byte-3) ,s selec,edfrom.he register file 42. Sumlarly, 
fte third and fourth rows represent the resuhs of operations 73 and 74, respectively. 
,0066, Refemng .o the fourth row, in addition ,0 .he resnl.s dtscussed above in conjunction 
wi.b operation 75, the address of the nex. odd byte in .he sc. (i.e„ byte-5) is select* from 
register file 42 for a PE having more than four values in the set. Likewise refemng to me 
fifth row, in addition to the results discussed above in conjunction with operation 76 .he 
value of byte-4,s loaded into register R2 and .he address for the nex, even byte 
byt e-6) is selected from register file 42. Values for each odd and even numbered byte w.,1 be 
altema.elyfed.hrough.hePE. As operational process 70 continues, the value of each 
subsequent odd numberedbyte is compared <o the valueofthe odd exfrema «ha. ,s saved ,n 
registerM! and any updated odd exfrema is saved back inteMI. Additionally, me va,neo 
each subseouen, even numbered byte is compared to the even extiema ,ha« is saved » renter 
M2 and any updated even exfrema is saved back into M2. The odd extiema an even extiema 
are no. compared .0 each o«her (i.e., as in operation 78) unti. me exteema for a„ o the odd 
by.es in .he se, and tire extiema for all of the even bytes in .he se. have been finally 
determined. Thus, operational process 70 can be scaled for any number of bytes wfthm a sc.. 
[00671 As discussed above, .he length of me value may also be scaled while remammg 
within the scope of the current invention. For example in an alternative embodrmen, 
operational process 70 is employed for finding .he exfrema of local shorts, wherein a short 
refers to a 16-bit value. As discussed above, die PE is an 8-bit processing element, thus each 
short retires two cycles to be processed. In the current alternative embod.men, each short ts 
proccsscdastwoseparateby.es.a-mos.s^ficanfMSbyteanda-.eas.stgn.fican, (LS) 



byte. The convention used in fine current alternative embodiment is known as "btg-endian", 
that is the MS byte is stored in the LS-register file address. 

(00681 AstheshortsarereadfiomtheregisterfilemmeyaresepamUdintotheoddand 
even data streams as discussed above. However, in the current alternative embodiment, both 
registers MO and Ml are used to hold a short within the even data stream (LS even byte in MO 
a „dMSevenby te inMl)a„dbomregts.ersM2andM3areusedtoho,dasho rt wi,h 1 nthe 
„ddda t ast re am(LSo<ldby.einM2andMSoddby,emM3).Additionally,regis,ersQ0and 

Ql are used to conditionally store the even exhema LS byte and the even extrema MS byte, 
respectively, and registers Q2 and Q3 are used to condttionally store the odd exhema LS byte 
and the odd extrema MS byte, respectively. 

,00691 The shorts are loaded into the ALU and a comparison of the odd extrema LS byte and 
MS byte to the next odd LS byte and next odd MS byte, respectively, is completed. Ltkewrse, 
a comparison of the even extrema LS byte and MS byte to the next even LS byte and next 
even MS byte, respectively, is completed. A companson of the LS and MS odd exhema 
bytes to the LS and MS even extrema bytes is then completed to determine a local exhema for 

m070 E , m the following embodiment, short-1, short-3, etc. form the stream of odd numbered 
va,ucs, and short-2, short-4, etc. formthe sheamof even numbered values. The movementof 
the shorts throughout the PE can be dtvided into a senes of Cock pulse, or cycles. I. should 
be noted that all operations within a cycle happen simultaneously. Thus if a register ,s bemg 
"read from" and "wotten to" in the same cycle, the "old" data moves out of the register a, fine 
same time that the "new" data moves into fine register. Accordingly, the old data is no, lost 
m the following example, theold contents of a particular register will be that value wnttento 
the regrster dunng the cycle the immedtately preceding the current cycle. If a value was no 
wnttento the parttcular regtster dunng the cycle immed,a,e.y preeedmg the current cycle, the 
old contents of.be register will be the las, value wntten to the register during the closes, 
preceding cycle ,o the current cycle. Dunng processing, certain actions take place m each 

lo'oTU Inthefirs.eycle.theLSby.eofshort-lisloadedintoreg.sterRl.Itshouldbeno.ed 
flna, the "first cycle" assumes that the LS and MS bytes have already been read from the 
register file. 

,0072) In the second cycle, the LS byte of short-1 is loaded from register Rl mto regtster 
M2 and the MS byte of short-1 is loaded into register Rl. 

[00731 I n,hemirdcycle,.heQ2re g ister(whichho,dsu,eLSby,efor,heoddex tt ema).s 
initialized with M2 (the LS byte of the first short for the odd stream), the MS byte of short-1 
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loaded ft on, regi s, CT R lta .o re gis tt rM3, m d the LS byte of short-2 is loaded into register 
Rl 

|0074) In the fourth cycle, the Q3 register (which holds the MS byte for the odd extrema) is 
initialized with M3 (the MS byte of the first short for the odd stream), the LS byte of short-2 
rsioadedfromregis^rR.urtoregisterMO.anddreMSbyteofshort^isloadedartoregts.r 

Rl 

[00751 in the fifth cycle, the Q0 register (which holds the LS byte for the even externa) is 
initialized with MO (the LS byte of the first short for the even stream), the MS byte of short-2 
is ,„aded from register Rl into regrster Ml, and theLS byte of shor«-3 is loaded mto RL 
,00761 In the sixth cycle, the values of Q2 and Rl are compared, the Ql register (wmch 
holds the MS byte for the even extrema) is initialized with Ml (the MS byte of the first short 
for the even stream), the LS byte of short-3 is loaded from regts.erRl into register M2, and 
the MS byte ofshort-3 is loaded into Rl. „,.= ,, „.„f 

■00771 In the seventh cycle, the values of registers Q3 and Rl are compared, the MS byte 
short-3 is loaded from regtster Rl into register M3, and the LS byte of shorM is loaded into 



Rl 



,00781 In the eight cycle, the contents of regtster Q2 are conditionally updated wtth M2, the 
values of Q0 and Rl are compared, the LS byte of short-4 is loaded from register Rl mto 
registerMO.andtheMSbyteofshorMisloadedintoRL 

,0079, in the ninth cycle, the contents of register Q3 are conditionally updated wtth M3, the 
values ofQl andR. are compared, and the MS byte of short-4 is loaded from register Rl mto 
register Ml. 

[00801 In the tenth and eleventh cycles, the contents of registers Q0 and Ql, are 
conditionally updated with M0 and Ml respectively. 

,00811 nshouldberecognizedthattheabove-describedembodimentsofmeinvennonarc 
intended to be dlustrative only. For example, the architecture could be scaled to find the 
extrema of any sized input value, e.g. 4 byte va.ne Cong'), or 8 byte valuing long ). 
Thus to cope with 8 byte value, the archuecture would need to be extended to 16 Q renters 
and 16 M registers. Numerous alternative embodiments may be devised by those sktlled m 
the art without departing from the scope of the following claims. 
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